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Abstract 

The requirement of large sample sizes for calibrating items based on IRT models is not easily 
met in many practical pretesting situations. Although classical item statistics could be estimated with 
much smaller samples, the values may not be comparable across different groups of examinees. This 
study presented and evaluated a method of standardization that may be used by test practitioners to 
standardize classical item statistics when sample sizes are small. The effectiveness of this 
standardization approach was compared with the 1PL and 3PL models based on the criteria of the 
Pearson product-moment correlation, the MSE, variance and squared bias. 

In light of estimating the item difficulty values, the differences of the performance between 
the 3PL and standardization methods were small, but the differences between the 1PL and these two 
methods were large. For the estimation of point biserial correlations, the 3 PL model seemed to 
perform better than the standardization method, and the standardization method performed better than 
the 1PL model. Although the standardization method did not outperform the 3PL model for the 
design considered in this study, it could be promising when smaller sample sizes are used. This 
method may be recommended for use in conjunction with the IRT models for the test development 
when the pretesting sample sizes are small. By employing the classical measurement framework to 
obtain pretest item statistics, the problem of inaccurate IRT parameter estimates when limited 
calibration sample sizes are available can be avoided. 
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This study is designed to compare the effectiveness of a standardization method with IRT 
methods for scaling pretest item statistics to be on the same scale. When item responses are obtained 
from different groups of small sample sizes, employing traditional IRT item calibration or scaling 
may not be justified due to the large sample size requirement. This study is intended to explore the 
method of standardization in adjusting the item statistics obtained from various groups in pretesting. 
Specifically, this study attempts to achieve the following objectives: 

1 . to examine the effectiveness of a standardization method; 

2. to compare the item statistics recovery of the 1PL model, the 3PL model and the standardization 
method with small sample sizes. 



Theoretical Framework 

The problem arises when a large number of items are to be pretested but concerns about test 
security prevent all pretest items from being administered to the same group of examinees. To 
maintain test security each item should be seen by the smallest number of examinees possible while 
still obtaining good item statistics. To achieve this goal, the pretest items can be included in parallel 
forms and be administered to different small groups of examinees. It is not easy to use randomly 
equivalent groups, however, since the groups used for pretesting are often conveniently formed based 
on school and concerns about item exposure preclude the administration of different forms to the 
same school using a spiraling process. 

Based on specific groups of the examinees that are administered the test, classical item 
statistics such as item difficulty (i.e., the p-value) and the item discrimination (i.e., the biserial or 
point biserial correlation) are computed. A sample size of 150 to 200 examinees is usually sufficient 
to obtain stable estimates of these statistics. However, the classical item statistics based on different 
groups are not directly comparable, posing a problem in the test development. Instead of directly 
computing classical item statistics, item parameters of IRT models can be estimated using the 
response data. These parameters in each group can be converted to be on the same scale using ERT 
scaling or transformation methods. Estimates of the classical item statistics for all items in a 
particular group can be computed using the IRT parameter estimates. 

To achieve adequate precision in the item parameter estimates obtained using ERT, the 
number of examinees used for calibration is required to be moderately large (Hambleton, Jones, & 
Rogers, 1993; Tsutakawa & Johnson, 1990). The requirement of large sample sizes in practical 
pretesting situations can be hard to meet. Depending on the specific testing situation (e.g., the 
number and nature of the items on the test, the distribution of the examinees’ abilities) and the 
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particular IRT model chosen, the minimum number of examinees recommended for accurate item 
parameter estimation varies (Barnes & Wise, 1991). In general, greater numbers of items and more 
complex IRT models require larger sample sizes. 

The Rasch model, or the one-parameter logistic (1PL) model, specifies that the item difficulty 
is the only item characteristic that varies from item to item, holding the item discrimination values 
equal for all items. Because there is only one parameter to be estimated, this model does not require 
large sample sizes. Previous research suggested that a sample size of as large as 200 examinees 
would be sufficient to accurately estimate item parameters of the 1PL model (Wright & Stone, 1979). 
However, the 1PL model may not provide a good fit to multiple-choice items where discrimination 
indices are usually unequal, and examinees are likely to guess on the items. 

The three-parameter logistic (3PL) model (Birnbaum, 1968) is a more general model where 
the discriminating power is allowed to vary among items and guessing is allowed to occur for the 
examinees. However, in order to accurately estimate the 3PL item parameters, previous research 
suggested that at least 1,000 (Reckase, 1979; Skaggs & Lissitz, 1986) to 10,000 (Thissen & Wainer, 
1982) examinees would be needed. Estimating ERT item discriminating parameters requires larger 
sample sizes than estimating item difficulty parameters (Barnes & Wise, 1991). 

As an alternative to ERT item calibration with small sample sizes, this study proposes a 
standardization approach to adjust conventional item statistics which may perform better than IRT 
methods with small sample sizes. The purpose of using the standardization method is to adjust the 
item statistics obtained from small nonequivalent samples to more closely represent the item statistics 
in the population of interest. The idea is similar to that of the direct standardization described in 
Mosteller and Tukey (1977). This method is implemented by incorporating a set of common items 
across the various forms of a test and using the assumption that the conditional distributions of 
unique or noncommon items given the number correct score on the common items are the same 
across all groups of examinees. A joint distribution of a unique item score and the number correct 
common item score in the total group of examinees can then be obtained. Specifically, the 
standardization method is described as follows. 

Let Ug and Xcg be random variables representing the score on a unique item and the number 
correct score on a set of m common items, respectively, in a subpopulation of examinees, g. The 
joint distribution of the unique item and common item scores in the subpopulation g can be 
formulated as 
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Pr(Ug-w, Xcg-x) - Pr(Ug=«|Xcg=x) Pr(Xcg=x), g = 1 , .. ,G; u = 0, 1; x = 0, 1 ,. m, 

= 0, elsewhere. 

The notation of Ug - u represents the random variable Ug taking on the value of u, with the 
value of 1 for a correct response and 0 for an incorrect response in the subpopulation g and Xcg = x 
represents the random variable Xcg being equal to number correct score x in the subpopulation g, 
with the values of 0 to the number of common items, m. Pr(Ug=w|Xcg=x) is the conditional 
distribution of the unique item score given common item score and Pr(Xcg— x) is the marginal 
distribution of the common item score in the subpopulation g. 

For the entire population o (i.e., the examinees across all subpopulations), the joint 
distribution of the unique item and common item scores can be represented as 

Pr(Uo=w, Xco=x) = Pr(Uo=w|Xco=x) Pr(Xco=x), u = 0, 1; x = 0, 1 
= 0, elsewhere. 

Based on the assumption that the conditional distributions of the unique item response given common 
item score are the same across all groups, the above joint distribution can be written as 

Pr(Uo=w, Xco-x) = Pr(Ug=«|Xcg=x) Pr(Xco=x) for any subpopulation, g. 

where Pr(Xco=x), the distribution of the common item scores for the entire population o, is simply 
obtained based on the responses of all groups to the set of common items. Using this joint 
distribution of the entire population, estimates of classical item statistics that would have been 
obtained if the item had been given to a sample from the entire population can be calculated. An 
estimate of the classical p-value in the entire population (i.e., the average probability of correctly 
answering an item in the population of examinees) can be obtained from the joint distribution 
Pr(Uo=w, Xco-x) by summing over the common item scores for the correct unique item response. 

The point biserial correlation between the unique item score and common item score can be obtained 
from this joint distribution of the unique item and common item scores in the population of 
examinees also. 

Random error in this standardization method can be reduced using smoothing methods. A 
bivariate polynomial log-linear model analogous to that described in Hanson (1991) and Rosenbaum 
and Thayer (1987) can be employed to smooth the joint distribution of the unique and common items 
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Pr(Ug-w, Xcg-x). The marginal distribution of the common items Pr(Xco=x) can be smoothed 
using a univariate polynomial log-linear model described in Kolen (1991). 

Method and Data 

Simulations were employed to carry out this study. Described below are the data, the 
simulation procedure, and the criteria used for the data analyses. 

The Test 

Ten forms were created from ACT Assessment (ACT, 1997) Mathematics items to be as 
parallel as possible in their content and statistical specifications. Each form consisted of 24 unique 
items and 12 common items. In each form the common items followed the unique items. Three- 
parameter logistic model item parameters were calibrated from multiple forms of the ACT 
Assessment Mathematics Test using BELOG (Mislevy & Bock, 1990). The data used for the 
calibration were randomly equivalent groups of examinees taking each form of the ACT Assessment, 
so all item parameter estimates were on a common scale. These calibrated item parameters were 
treated as the population item parameters in this study. 

Table 1 presents the summary statistics of the item parameters a , b and c of the various forms 
in representing the item discrimination parameter, the item difficulty parameter and the guessing 
parameter, respectively. The N column lists the number of common or unique items. 

The Samples and Population 

Ten groups of examinees were generated, with a sample size of 250 examinees in each group 
to represent a situation where the group size was small. The ability distribution of the examinees in 
each group was generated based on a normal distribution. The current study was intended to simulate 
a situation where the various groups vary in their ability averages since randomly equivalent groups 
cannot be easily obtained using spiraling in real pretesting situations. Meanwhile, the ability mean 
differences that are expected among the groups could be small because efforts can still be made to 
mitigate the differences during sampling. In this study, the variances were set to be 1.0 for all groups, 
and the means were specified to be -0.4, -0.3, ..., 0.5 for groups 1 through 10, respectively. That is, 
the ability of the various groups was distributed as iV(-0.4, 1), N(- 0.3, 1),. . ., N( 0.5, 1), respectively. 
The entire sample of all ten groups was the chosen population in this study. Thus, the population 
ability distribution was a mixture of the ten normal distributions. The density of the mixture 
distribution at any 9 point was a weighted sum of the ten densities using weight . 10 for each density. 

Based on the ability distributions of the examinees in the groups and the population item 
parameters, random responses of the items were generated for the various groups. Every examinee in 
each of the ten groups responded to 24 unique items and the set of 12 common items. 



The Population Item Statistics 

The population conventional item statistics (both the p-values and point biserial correlations) 
were obtained based on the population item parameters and the population ability distribution. The 
population p-value of an item was computed by evaluating the integral 

/> = JPr(U=l|0)/o(0)d0, 

where Pr(U=l|0) is the conditional probability of the correct item response given a particular 0 value 
and is calculated using the item characteristic curve (ICC) 



where a, b and c are parameters that characterize the item. For the 3PL model, all the three 
parameters are considered while for the 1PL model, the a parameter is a constant and the c parameter 
is 0 for all items. The distribution /o(0) is the ability distribution of the overall population. 
Specifically, the population p-value of an item was derived by approximating this continuous 
distribution /o(0) with a discrete distribution at the 0 levels equally spaced over the interval of -5.0 to 
5.0 with an increment of .10 (i.e., -5.0, -4.9,..., 5.0), totaling 101 points. That is, the integral was 
replaced by the sum over these discrete 0 points, with /o(0) being the discrete probabilities of the 0 
points. 

Because the various forms of the test consisted of different unique items and a set of common 
items, the population point biserial correlation was defined in this study as the correlation between 
the unique item score and common item score instead of the total test score. The purpose of this 
study was to produce the item statistics that were comparable across groups. Point biserial 
correlations were computed between the unique item scores and common item scores so that all items 
were correlated with a common variable, making the correlations more comparable across items. The 
population point biserial correlation was computed based on the joint distribution of the unique item 
and the common items in the overall population Pr(Uo=w, Xco=x). This joint distribution Pr(Uo=w, 
Xco=x) is given by 



Pr(£/ = l|0) = c + (l-c) 






\+e' Me - b) ’ 



Pr(Uo=w, Xco=x) 

=j Pr(U=w, Xc|0)/o(0) dQ 
=1 Pr(U=w|0) Pr(Xc|0)/o(0) dQ 
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where Pr(U=l|0) was calculated using the ICC and Pr(U=O|0) is l-Pr(U=l|0). Pr(Xc|0), the 
distribution of the number correct common item scores conditioned on a particular 0 value, was 
obtained by the Lord-Wingersky algorithm (Lord & Wingersky, 1984). In this study, there were 13 
possible values ofPr(Xc|0) for a particular 0, one for each of the common item scores of 0, 1, ..., 12. 

To derive this joint distribution Pr(Uo=w, Xco=x), the continuous distribution /o(0) was also 
approximated with a discrete distribution at the 0 levels equally spaced over the interval of -5.0 and 
5.0 with an increment of .10. That is, the integral was replaced by the sum over these discrete levels, 
with /o(0) being the discrete probabilities of the 0 points. 

The correlation based on values of this bivariate distribution Pr(Uo=w, Xco=x) was the 
population point biserial correlation of interest in this study, as 

X^ p r(f/ o = M ,Ab o =x)-X^Pr(^/ 0 =^) £xPr(Ac 0 =x) 

Jjy Pr (U 0 =«)-£> p r(£/ 0 = «)f \/x * 2 Pr( Xc 0 = x) - [£ x Pr( Xc 0 - = x)] 2 ’ 



where Pr(Uo-w, Xco-x) is the joint distribution of the unique item and common item scores and 
Pr(Uo=w) and Pr(Xco=x) are the marginal distributions of the unique item and common items, 
respectively. 

The Estimated Item Statistics 

The estimated item statistics (both the p-values and point biserial correlations) were obtained 
using the 1PL model, the 3PL model and the standardization method, respectively. For both the 1PL 
and 3PL models, BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) was utilized first to 
estimate the item parameters using the nonequivalent groups equating. Then, these item parameters 
were converted to the conventional p-values and point biserial correlations using the following 
formulas: 

p =J Pr(U=l|0)/o(0)d9, and 

- Z P r (£/ 0 Pr (JJ 0 = u)£ x Pr (Xc, = x) 

Jjy Pr (U 0 = uy - [X u Pr (U 0 = u ) f ^ x 2 Pr( Xc 0 = *) - [^ * M&. = ^)] 2 ’ 



where Pr (TJ — 1 10) is the probability of correctly answering an item conditional on 0, which was 
estimated based on the 1PL or 3PL model; /o(0) is the ability distribution in the overall population; 
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and also, Pr (Uo=w, Xco=x), Pr (Uo=w) and Pr (Xco=x) are the respective distributions that were 
estimated based on the 1PL or 3PL model. 

Also based on the response data in each of the ten groups, the classical p-value and point 
biserial correlation were computed for each unique item. The standardization approach described 
above was used to estimate the conventional item statistics in the entire population. In this study, the 
joint distribution of the unique and common items Pr(Ug=w, Xcg=x) was smoothed using two 
bivariate polynomial log-linear models (Hanson, 1991; Rosenbaum & Thayer, 1987). The first 
model used degree 4 for the common item score, degree 1 for the unique item score, and a degree 1 
interaction term. The second model used degree 5 for the common item score, degree 1 for the 
unique item score, and a degree 2 interaction term. The marginal distribution of the common items 
Pr(Xco=x) was smoothed using a univariate polynomial log-linear model in Kolen (1991) with 
polynomial degree 4. This smoothed marginal distribution was used with each of the two smoothed 
bivarate distributions in each group to produce two estimates of the p-value and point biserial 
correlation. 

Replications 

The above process of estimating the item statistics was replicated 500 times for the three 
methods, respectively. 

The Criteria 

The population p-values and point biserial correlations were used as the baselines for 
evaluating the accuracy of the estimated p-values and point biserial correlations based on the 1PL 
model, the 3PL model, and the standardization method. Two indices were used as the criteria. One 
was the Pearson product-moment correlation coefficient between the estimated and population item 
statistics. The other criterion was the mean square error (MSE) over items. The MSE value is the 
expected squared difference between the estimated and population item statistics and can be 
decomposed into variance and squared bias. Variance is the average squared difference between the 
estimated and the expected value of the estimated item statistics across replications. Bias is the 
difference between the expected value of the estimated item statistics and the population value across 
replications. 

Provided below are the formulas used to compute the MSE, variance and squared bias for the 
p-value over the 500 replications with respect to each of the unique items /. 



MSE, = 



1 

500 



500 

^Pir-Pi) 2 ’ 



r = 1 
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| 500 




where r - 1, 2,..., 500, p ir is the estimated p-value of the unique item / for the rth replication, p t is 
the mean of the estimated p-values across the 500 replications, and p t is the population p-value of the 
unique item 

For the point biserial correlations, the following formulas are used to compute the MSE, 
variance and squared bias over the 500 replications with respect to each of the unique items / . 



where r— 1,2,..., 500, p ir is the estimated point biserial correlation of the unique item / for the rth 
replication, p. is the mean of the estimated point biserial correlations across the 500 replications, and 
p { is the population point biserial correlation of the unique item /. 

The average values of the MSE, variance and squared bias over items were computed as the 
criteria for the comparisons among the various methods. To facilitate the comparison among the 
average MSE for the various methods, the standard errors of the mean MSE over items are provided 
to indicate whether the differences among the average MSE values for the various methods were 
large relative to the errors introduced by estimating these averages by simulation using 500 
replications. The standard error of the mean MSE over items (i.e., the variability over the 500 
replications of the average MSE over items) was computed by 
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\ 240 ■, 1 240 

For the rth replication, MSE r = — -£(A, ~ PtY for the P-value and MSE' r = — £(p„. - p,) 2 for 

240 i=1 240 

the point biserial correlation, where / = 1, 2,. . 240 is the number of items. MSE is the average of 
the MSE r values across the 500 replications. 



Results 

Described below are the results of the 1PL model, the 3PL model and the standardization 
method with respect to each of the criteria: the Pearson product-moment correlation, the MSE, 
variance and squared bias for both the p-values and point biserial correlations. For the 
standardization approach, the two bivariate smoothing methods produced similar results, so only the 
results for the model using a fourth degree polynomial for the common item score are presented. 

The Resu l ts in Terms of the Pearson Produ c t-Moment Correlation 

Summary statistics for the Pearson product-moment correlation coefficients between the 
estimated and population item statistics are displayed in Table 2 for the various methods. It can be 
seen that for the p-values, both the 1PL and standardization methods recovered the population p- 
values to a similar degree. The average correlation for the 3PL model was only slightly higher than 
that for the 1PL or standardization method. In fact, it may be concluded that these three methods 
performed equally well in recovering the population p-values. 

Table 2 shows that all methods performed less well in recovering the point biserial than the p- 
values. The employment of the 3PL model still resulted in the highest average correlation between 
the estimated and population item statistics. The standardization method performed slightly worse 
than the 3PL model in recovering these correlations. It can be seen that the performance of the 1PL 
model was relatively poor among the three methods. 

The Results i n Terms of the M SE. Variance and Squared Bias 

Table 3 shows the summary statistics for p-value MSE, variance and squared bias over the 
240 items. It can be seen that the average MSE value was slightly lower for the 3PL model than for 
the standardization method. However, relative to the error in the estimates due to estimation by 
simulation with the 500 replications, the difference between these two methods was small. Therefore, 
the results did not provide a clear indication as to which method was better. 

The employment of the 1PL model led to the greatest average MSE over items (see Table 3). 
The difference between the 1PL model and either of the 3PL and standardization methods was large 
relative to the standard error. The results showed that the 3PL and standardization methods produced 
smaller overall error for estimating the p-values than the 1PL method. 
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With respect to the variance, the average value over items was lower for the standardization 
method than for the other two methods. The average squared bias for the 3PL model was relatively 
lower than for both the 1PL and standardization models. That the squared bias was lowest for the 
3PL model was not unexpected in that the data were generated with the 3PL model. The average 
values of the squared bias were similar for the 1PL and standardization approaches. 

Displayed in Table 4 are the summary statistics of the MSE, variance and squared bias over 
the 240 items for the point biserial correlations. There seems to be more differences among the 
methods with regard to these correlations than the p-values. The average MSE value was the lowest 
for the 3PL model. The standardization method produced a higher average MSE over the 240 items 
than the 3PL model. The average MSE was even higher for the 1PL model. The differences among 
the methods in average MSE were large relative to the standard errors of the average MSE. These 
findings seem to suggest that the 3PL model performed better overall than the standardization 
approach, and the standardization approach performed better overall than the 1PL model in terms of 
estimating the point biserial correlations. 

While the average variance over items for the standardization method was the lowest for the 
p-values (see Table 3), the average variance for the standardization method was the highest for the 
point biserial correlations in Table 4. The 1PL model had the lowest average value of the variance. 
For the squared bias, it can be seen that the average values for both the 3PL and standardization 
methods were substantially lower than that for the 1PL model. The squared bias was slightly lower 
for the 3PL model than the standardization method, which was again, not unexpected since the 
population item parameters were generated with the 3PL model. 

Conclusions 

Based on the response data of the various examinee groups, IRT models can be employed to 
estimate item parameters and convert these parameters to be on the same scale using IRT scaling or 
transformation methods. However, when the examinee groups are small, employing traditional IRT 
item calibration or scaling may not be justified due to the large sample size requirement. This study 
explored the standardization approach to adjust conventional item statistics which has a less strict 
sample size requirement. The purpose of using the standardization method was to adjust the item 
statistics obtained from small nonequivalent samples to more closely represent the item statistics that 
would have been obtained if the population of interest has been employed. This method was 
implemented by incorporating a set of common items across the various forms of a test and using the 
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assumption that the conditional distributions of unique or noncommon items given common items 
were the same across all groups of examinees. 

The effectiveness of the standardization approach was compared with that of the 1PL and 3PL 
models using the Pearson product-moment correlation, the MSE, variance and squared bias as the 
criteria for evaluation. The results showed that in light of estimating the p-values, the differences of 
the performance between the 3PL and standardization methods were small relative to the standard 
error, but the differences between the 1PL and the other two methods were large relative to the 
standard error. For the estimation of the population point bi serial correlations, the 3PL model 
seemed to perform better than the standardization method, and the standardization method performed 
better than the 1PL model. 

The standardization method proposed in this study failed to outperform the 3PL model in 
recovering the population point biserial, but the data were generated using a 3PL model. The relative 
performance of the standardization method and the 3PL model may differ when the data do not 
perfectly fit a 3PL model. Also, the ten forms of the test were created using item parameters 
calibrated based on the operational ACT Assessment Mathematics items of high quality. This could 
impact the results of this study since variation in the item statistics of prestest items could be greater. 
In addition, it remains unknown how robust the 3PL model would be to even smaller sample sizes. 
While the classical item statistics might be stable based on samples as small as of 150 to 200 
examinees and the standardization method might still satisfactorily recover the population item 
statistics, parameter estimation based on the ERT methods may not be justified and the performance 
of the 3PL model could be deteriorated. Further studies using smaller sample sizes could investigate 
this issue. 

Also, in this study each form consisted of 24 unique items and 12 common items. The results 
of this study may not generalize to situations where the number of pretest or common items differs 
from the values used in this study. Results for which of the methods would perform better can not be 
simply implied based on the findings of this study. Moreover, while group differences affect the 
adjustment of the standardization method, it would also affect estimation with the IRT models. The 
issue of the degree to which the methods are affected by group differences needs more consideration. 
It is also worthwhile to implement other ERT models such as the 2PL model or a modified model and 
compare its effectiveness with the standardization approach. 

The standardization procedure was carried out based on the assumption that the conditional 
distributions of unique or noncommon items given common items are the same across all groups of 
examinees. Factors such as the number and location of common items imbeded, the 
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representativeness of common items of the entire test, and the item characteristics of these common 
items might affect the viability of this assumption. It is not known the extent to which the 
assumption held in the present study. 

The requirement of large sample sizes for calibrating items based on IRT models is not easily 
met in many practical pretesting situations. Although classical item statistics could be estimated with 
much smaller samples, the values may not be comparable across different groups of examinees. This 
study presented and evaluated a method that may be used by test practitioners to standardize classical 
item statistics when sample sizes are small. Although the standardization method performed slightly 
less well than the 3PL model for the design considered in the current study, this method of 
standardization could be promising when smaller sample sizes are used. This method may be 
recommended for use in combination with the IRT models for the test development when the 
pretesting sample sizes are small. By employing the classical measurement framework to obtain 
pretest item statistics, the problem of inaccurate IRT parameter estimates when limited calibration 
sample sizes are available can be avoided. 
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Table 1. Summary Statistics of the Population Item Parameters for 
the Common Items and the Unique Items for the Various Forms 



The Common Items 



Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


12 


0.9349 


0.2658 


0.5360 


1.3540 


b 


12 


0.0683 


0.6553 


-0.6980 


1.3020 


c 


12 


0.1698 


0.0548 


0.0960 


0.3000 



The Unique Items 



Form 1 



Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


1.0079 


0.2837 


0.4960 


1.6640 


b 


24 


-0.0754 


0.9047 


-1.8650 


1.7440 


c 


24 


0.2005 


0.0901 


0.0570 


0.4390 



Form 2 



Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


1.1303 


0.3664 


0.6140 


1.8760 


b 


24 


-0.0787 


1.1617 


-2.5220 


1.5140 


c 


24 


0.1805 


0.0831 


0.0570 


0.4750 



Form 3 



Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


0.9085 


0.3338 


0.4200 


1.7150 


b 


24 


-0.0849 


1.2512 


-2.7130 


2.0650 


c 


24 


0.1651 


0.0626 


0.0430 


0.3100 


Form 4 


Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


1.0865 


0.4130 


0.3270 


2.1230 


b 


24 


0.0100 


1.3322 


-3.7180 


1.7940 


c 


24 


0.1622 


0.0622 


0.0410 


0.2640 



Form 5 



Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


1.2420 


0.3410 


0.5390 


1.9630 


b 


24 


-0.0134 


0.8635 


-2.3470 


1.6220 


c 


24 


0.1775 


0.0635 


0.0710 


0.3050 




17 



Table 1. (Continued) 



Form 6 



Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


1.1430 


0.3739 


0.6320 


2.1210 


b 


24 


-0.0845 


0.8920 


-2.2830 


1.5000 


c 


24 


0.1588 


0.0563 


0.0710 


0.2640 



Form 7 



Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


1.0472 


0.3817 


0.4960 


2.3790 


b 


24 


-0.0241 


0.8904 


-1.6660 


1.7780 


c 


24 


0.1738 


0.0840 


0.0320 


0.4320 


Form 8 


Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


0.9976 


0.3075 


0.5230 


1.6130 


b 


24 


-0.2165 


1.1063 


-2.2650 


1.6680 


c 


24 


0.1397 


0.0530 


0.0640 


0.2590 


Form 9 


Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


1.2295 


0.3962 


0.6160 


2.1010 


b 


24 


-0.0744 


1.2673 


-3.4590 


1.3490 


c 


24 


0.1681 


0.0760 


0.0500 


0.3430 



Form 10 



Item Parameters 


N 


Mean 


SD 


Minimum 


Maximum 


a 


24 


1.0041 


0.3205 


0.6150 


1.7140 


b 


24 


0.0459 


1.2908 


-2.2920 


2.4940 


c 


24 


0.1744 


0.0843 


0.0830 


0.4910 
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Table 2. Summary Statistics of the Correlations between the Estimated 

and Population Item Statistics 



p-Values 



Method 


N 


Mean 


SD 


Minimum 


Maximum 


1PL 


500 


0.98990 


0.00105 


0.98657 


0.99236 


3PL 


500 


0.99121 


0.00094 


0.98802 


0.99386 


Standardization 


500 


0.98899 


0.00132 


0.98440 


0.99254 


Point Biserial Correlations 


Method 


N 


Mean 


SD 


Minimum 


Maximum 


1PL 


500 


0.61584 


0.01212 


0.57734 


0.64596 


3PL 


500 


0.87582 


0.01475 


0.80593 


0.91449 


Standardization 


500 


0.82997 


0.01739 


0.77128 


0.88271 
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Table 3. Summary Statistics for the Estimated p-value MSE, Variance 
and Squared Bias for the Various Methods 



MSE 


Method 


N 


Mean 


SD 


Minimum 


Maximum 


Standard Error 


1PL 


240 


0.00108 


0.00049 


0.00000 


0.00063 


0.00001487143 


3PL 


240 


0.00095 


0.00025 


0.00000 


0.00071 


0.00001436668 


Standardization 


240 


0.00096 


0.00038 


0.00000 


0.00125 


0.00000525862 



Variance 


Method 


N 


Mean 


SD 


Minimum 


Maximum 


1PL 


240 


0.00085 


0.00025 


0.00015 


0.00132 


3PL 


240 


0.00091 


0.00023 


0.00017 


0.00132 


Standardization 


240 


0.00073 


0.00019 


0.00012 


0.00110 


Squared Bias 


Method 


N 


Mean 


SD 


Minimum 


Maximum 


1PL 


240 


0.00023 


0.00035 


0.00000 


0.00212 


3PL 


240 


0.00004 


0.00006 


0.00000 


0.00032 


Standardization 


240 


0.00023 


0.00027 


0.00000 


0.00133 



Table 4. Summary Statistics for the Estimated Point Biserial MSE, Variance 
and Squared Bias for the Various Methods 



MSE 



Method 


N 


Mean 


SD 


Minimum 


Maximum 


Standard Error 


1PL 


240 


0.00649 


0.00968 


0.00000 


0.00308 


0.0000609573 


3PL 


240 


0.00253 


0.00093 


0.00000 


0.00242 


0.0000343220 


Standardization 


240 


0.00378 


0.00169 


0.00000 


0.00335 


0.0000187347 



Variance 


Method 


N 


Mean 


SD 


Minimum 


Maximum 


1PL 


240 


0.00048 


0.00014 


0.00038 


0.00134 


3PL 


240 


0.00202 


0.00054 


0.00104 


0.00457 


Standardization 


240 


0.00311 


0.00076 


0.00165 


0.00614 


Squared Bias 


Method 


N 


Mean 


SD 


Minimum 


Maximum 


1PL 


240 


0.00602 


0.00968 


0.00000 


0.07004 


3PL 


240 


0.00052 


0.00076 


0.00000 


0.00703 


Standardization 


240 


0.00067 


0.00131 


0.00000 


0.01005 
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